351 research outputs found
Generating Multi-Categorical Samples with Generative Adversarial Networks
We propose a method to train generative adversarial networks on mutivariate
feature vectors representing multiple categorical values. In contrast to the
continuous domain, where GAN-based methods have delivered considerable results,
GANs struggle to perform equally well on discrete data. We propose and compare
several architectures based on multiple (Gumbel) softmax output layers taking
into account the structure of the data. We evaluate the performance of our
architecture on datasets with different sparsity, number of features, ranges of
categorical values, and dependencies among the features. Our proposed
architecture and method outperforms existing models
Impact of Biases in Big Data
The underlying paradigm of big data-driven machine learning reflects the
desire of deriving better conclusions from simply analyzing more data, without
the necessity of looking at theory and models. Is having simply more data
always helpful? In 1936, The Literary Digest collected 2.3M filled in
questionnaires to predict the outcome of that year's US presidential election.
The outcome of this big data prediction proved to be entirely wrong, whereas
George Gallup only needed 3K handpicked people to make an accurate prediction.
Generally, biases occur in machine learning whenever the distributions of
training set and test set are different. In this work, we provide a review of
different sorts of biases in (big) data sets in machine learning. We provide
definitions and discussions of the most commonly appearing biases in machine
learning: class imbalance and covariate shift. We also show how these biases
can be quantified and corrected. This work is an introductory text for both
researchers and practitioners to become more aware of this topic and thus to
derive more reliable models for their learning problems
PHom-GeM: Persistent Homology for Generative Models
Generative neural network models, including Generative Adversarial Network
(GAN) and Auto-Encoders (AE), are among the most popular neural network models
to generate adversarial data. The GAN model is composed of a generator that
produces synthetic data and of a discriminator that discriminates between the
generator's output and the true data. AE consist of an encoder which maps the
model distribution to a latent manifold and of a decoder which maps the latent
manifold to a reconstructed distribution. However, generative models are known
to provoke chaotically scattered reconstructed distribution during their
training, and consequently, incomplete generated adversarial distributions.
Current distance measures fail to address this problem because they are not
able to acknowledge the shape of the data manifold, i.e. its topological
features, and the scale at which the manifold should be analyzed. We propose
Persistent Homology for Generative Models, PHom-GeM, a new methodology to
assess and measure the distribution of a generative model. PHom-GeM minimizes
an objective function between the true and the reconstructed distributions and
uses persistent homology, the study of the topological features of a space at
different spatial resolutions, to compare the nature of the true and the
generated distributions. Our experiments underline the potential of persistent
homology for Wasserstein GAN in comparison to Wasserstein AE and Variational
AE. The experiments are conducted on a real-world data set particularly
challenging for traditional distance measures and generative neural network
models. PHom-GeM is the first methodology to propose a topological distance
measure, the bottleneck distance, for generative models used to compare
adversarial samples in the context of credit card transactions
Predicting Sparse Clients' Actions with CPOPT-Net in the Banking Environment
The digital revolution of the banking system with evolving European
regulations have pushed the major banking actors to innovate by a newly use of
their clients' digital information. Given highly sparse client activities, we
propose CPOPT-Net, an algorithm that combines the CP canonical tensor
decomposition, a multidimensional matrix decomposition that factorizes a tensor
as the sum of rank-one tensors, and neural networks. CPOPT-Net removes
efficiently sparse information with a gradient-based resolution while relying
on neural networks for time series predictions. Our experiments show that
CPOPT-Net is capable to perform accurate predictions of the clients' actions in
the context of personalized recommendation. CPOPT-Net is the first algorithm to
use non-linear conjugate gradient tensor resolution with neural networks to
propose predictions of financial activities on a public data set
Human in the Loop: Interactive Passive Automata Learning via Evidence-Driven State-Merging Algorithms
We present an interactive version of an evidence-driven state-merging (EDSM)
algorithm for learning variants of finite state automata. Learning these
automata often amounts to recovering or reverse engineering the model
generating the data despite noisy, incomplete, or imperfectly sampled data
sources rather than optimizing a purely numeric target function. Domain
expertise and human knowledge about the target domain can guide this process,
and typically is captured in parameter settings. Often, domain expertise is
subconscious and not expressed explicitly. Directly interacting with the
learning algorithm makes it easier to utilize this knowledge effectively.Comment: 4 pages, presented at the Human in the Loop workshop at ICML 201
The Art of The Scam: Demystifying Honeypots in Ethereum Smart Contracts
Modern blockchains, such as Ethereum, enable the execution of so-called smart
contracts - programs that are executed across a decentralised network of nodes.
As smart contracts become more popular and carry more value, they become more
of an interesting target for attackers. In the past few years, several smart
contracts have been exploited by attackers. However, a new trend towards a more
proactive approach seems to be on the rise, where attackers do not search for
vulnerable contracts anymore. Instead, they try to lure their victims into
traps by deploying seemingly vulnerable contracts that contain hidden traps.
This new type of contracts is commonly referred to as honeypots. In this paper,
we present the first systematic analysis of honeypot smart contracts, by
investigating their prevalence, behaviour and impact on the Ethereum
blockchain. We develop a taxonomy of honeypot techniques and use this to build
HoneyBadger - a tool that employs symbolic execution and well defined
heuristics to expose honeypots. We perform a large-scale analysis on more than
2 million smart contracts and show that our tool not only achieves high
precision, but is also highly efficient. We identify 690 honeypot smart
contracts as well as 240 victims in the wild, with an accumulated profit of
more than $90,000 for the honeypot creators. Our manual validation shows that
87% of the reported contracts are indeed honeypots
Improving Missing Data Imputation with Deep Generative Models
Datasets with missing values are very common on industry applications, and
they can have a negative impact on machine learning models. Recent studies
introduced solutions to the problem of imputing missing values based on deep
generative models. Previous experiments with Generative Adversarial Networks
and Variational Autoencoders showed interesting results in this domain, but it
is not clear which method is preferable for different use cases. The goal of
this work is twofold: we present a comparison between missing data imputation
solutions based on deep generative models, and we propose improvements over
those methodologies. We run our experiments using known real life datasets with
different characteristics, removing values at random and reconstructing them
with several imputation techniques. Our results show that the presence or
absence of categorical variables can alter the selection of the best model, and
that some models are more stable than others after similar runs with different
random number generator seeds
MQLV: Optimal Policy of Money Management in Retail Banking with Q-Learning
Reinforcement learning has become one of the best approach to train a
computer game emulator capable of human level performance. In a reinforcement
learning approach, an optimal value function is learned across a set of
actions, or decisions, that leads to a set of states giving different rewards,
with the objective to maximize the overall reward. A policy assigns to each
state-action pairs an expected return. We call an optimal policy a policy for
which the value function is optimal. QLBS, Q-Learner in the
Black-Scholes(-Merton) Worlds, applies the reinforcement learning concepts, and
noticeably, the popular Q-learning algorithm, to the financial stochastic model
of Black, Scholes and Merton. It is, however, specifically optimized for the
geometric Brownian motion and the vanilla options. Its range of application is,
therefore, limited to vanilla option pricing within financial markets. We
propose MQLV, Modified Q-Learner for the Vasicek model, a new reinforcement
learning approach that determines the optimal policy of money management based
on the aggregated financial transactions of the clients. It unlocks new
frontiers to establish personalized credit card limits or to fulfill bank loan
applications, targeting the retail banking industry. MQLV extends the
simulation to mean reverting stochastic diffusion processes and it uses a
digital function, a Heaviside step function expressed in its discrete form, to
estimate the probability of a future event such as a payment default. In our
experiments, we first show the similarities between a set of historical
financial transactions and Vasicek generated transactions and, then, we
underline the potential of MQLV on generated Monte Carlo simulations. Finally,
MQLV is the first Q-learning Vasicek-based methodology addressing transparent
decision making processes in retail banking
- …